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Abstract 

Introduction: A gene tree for a gene family is often discordant with the containing species tree because of its 
complex evolutionary course during which gene duplication, gene loss and incomplete lineage sorting events 
might occur. Hence, it is of great challenge to infer the containing species tree from a set of gene trees. One 
common approach to this inference problem is through gene tree and species tree reconciliation. 

Results: In this paper, we generalize the traditional least common ancestor (LCA) reconciliation to define a 
reconciliation between a gene tree and species tree under the tree homomorphism framework. We then study the 
structural properties of the space of all reconciliations between a gene tree and a species tree in terms of the 
gene duplication, gene loss or deep coalescence costs. As application, we show that the LCA reconciliation is the 
unique one that has the minimum deep coalescence cost, provide a novel characterization of the reconciliations 
with the optimal duplication cost, and present efficient algorithms for enumerating (nearly-)optimal reconciliations 
with respect to each cost. 

Conclusions: This work provides a new graph-theoretic framework for studying gene tree and species tree 
reconciliations. 



Background 

With much higher speed than the traditional Sanger 
sequencing technology, the ultra-deep sequencing tech- 
nology has made huge amounts of molecular data avail- 
able for genomics study [1]. It provides an 
unprecedented opportunity to infer phylogenetic trees 
from multilocus and genomics data. One approach to 
inferring phylogeny from multilocus data is to recon- 
struct a gene tree from each locus and then to combine 
the resulting trees into a phylogeny, called the contain- 
ing species tree. Gene trees are often different since 
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each gene family might undergo different mutational 
events such as gene duplication and loss, horizontal 
gene transfer, and incomplete lineage sorting [2,3]. 
Therefore, the containing species tree is inferred from 
gene trees by reconciling it with each gene tree to mini- 
mize the total number of hypothetical evolutionary 
events that are responsible for the discordance between 
the trees. 

The gene tree and species tree reconciliation was first 
introduced by Goodman et al. [4] and formally defined 
by Page [5]. Given a gene tree for a gene family and a 
containing species tree, a reconciliation between them 
represents an evolutionary scenario of the gene family 
within the evolutionary history represented by the spe- 
cies tree [4]. To study gene duplication history, gene 
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tree and species tree are reconciled to minimize the 
number of gene duplications and/or losses. The mathe- 
matical and algorithmic issues of gene tree and species 
tree reconciliations have been intensively studied in the 
past decade [6-14]. For example, it has been shown that 
the so-called least common ancestor (LCA) reconcilia- 
tion has the minimum duplication and loss cost [9,15]. 

Although the LCA reconciliation is optimal in terms 
of the duplication cost, it may not represent the true 
evolution of the gene family being considered. Indeed, 
recent studies suggest that more than one reconcilia- 
tions may occur with the highest probability [16,17]. 
Such studies [3,14,17,18] in the stochastic framework 
assume that the discordance between a gene tree and a 
species tree is caused by incomplete lineage sorting and 
adopt Kingman's coalescent theory from population 
genetics [19]. 

The fact that the LCA reconciliation may not be the 
unique optimal with respect to the duplication cost 
motivates researchers to study the space of all the 
reconciliations and develop algorithms to enumerate 
nearly-optimal reconciliations for a species tree and a 
gene tree [20,21]. In this paper, we take a different 
approach to these two issues. We generalize the LCA 
reconciliation to define an arbitrary reconciliation as a 
vertex-mapping from a gene tree to a species tree that 
preserves the hierarchical structure of the gene tree. 
Our approach is essentially different from the existing 
ones [20,22], where the specific mutation events are 
used and a gene tree vertex is mapped to a species 
tree branch to specify a duplication event. One advan- 
tage of our approach over the others is that we sepa- 
rate reconciliation concept from the cost models that 
are used to measure the tree discordance. Because of 
this, we are able to study the structural properties of 
the space of all reconciliations between a gene tree 
and a species tree in the same manner for each of the 
three cost models. We show that the LCA reconcilia- 
tion has not only the minimum duplication and loss 
cost [9,15], but also the minimum deep coalescence 
cost. We also present a novel characterization of the 
reconciliations with the optimal duplication cost, and 
develop efficient algorithms for enumerating (nearly-) 
optimal reconciliations with respect to each cost 
model. 

Methods 

Basic notations 

Species evolve from their common ancestor through a 
series of speciation events. A species tree represents the 
evolutionary history of a set of species. A gene family 
might evolve from its common ancestral gene through 
gene duplication and loss events. Here we will assume 
that no lateral gene transfer has occurred. 



Both gene and species trees are rooted trees with 
labeled leaves. In a species tree, a leaf x represents a 
species, the label of x. Hence, the species tree is 
uniquely leaf-labeled. In a gene tree, a leaf y represents 
a gene found in a species. To infer the duplication his- 
tory of a gene family, its gene tree and the containing 
species tree is reconciled [4]. For this purpose, a leaf of 
a gene tree is labeled with the containing species. Since 
a species may contain duplicate genes, two leaves in a 
gene tree can have the same label. 

Let T be a species or gene tree; its vertex set and edge 
set are denoted by V(T) and E(T), respectively. Given 
two vertices u and v in T, there exists a unique path P 
(u, v) from u to v. The number of edges in P(u, v), 
denoted by d(u, v), is called the distance between u and 
v. Note that d{u, v) = 0 if and only if u = v. The node v 
is a descendant of u or u is an ancestor of v, denoted by 
v < u, if u is on the unique path from r(I), the root of 
r, to v. For simplicity, we also write v <u if v < u and v 
* u. Given a set A of vertices in T, u is a common 
ancestor of A if and only if v < u for every v e A. In 
addition, if u < u' for any other common ancestor u' of 
A, then we say u is the least common ancestor of A, 
written as lca(A), or lca^, u k ) if A = {u lf u k }. 

For each vertex u in T with u * r(T), the parent of u, 
denoted by p(u), is the unique vertex in T that is adja- 
cent to u and contained on the path from r(T) to u. In 
this case, u is also called a child of p(u). The out-degree 
of u, denoted by d{u), is defined as the number of the 
children of u. Obviously, a node is a leaf if and only if 
its out-degree is 0. Non-leaf nodes are internal nodes; 
they form a subset V\T) of V( J). If every internal vertex 
has out-degree two, then T is binary. For an internal 
vertex u in a binary tree, its two children are denoted 
by u x and u 2 , unless stated otherwise. In this study, we 
will focus on the case that gene trees and species trees 
are binary. For a vertex u, we use L(u) to denote the set 
of the labels of its leaf descendants and call it the cluster 
induced by u. Finally, we use L(T) to denote the set of 
leaf labels, i.e., the cluster induced by the root of T. 

Reconciliation between gene tree and species tree 

Let S be a species tree over a set of species and G a 
gene tree such that L(G) <= L(S), i.e., G is over all the 
homologous genes of a gene family found in some spe- 
cies. A map /from V(G) to V(S) is order-preserving if 
for each pair of vertices u, v in G, u < v implies j{u) < f 
(v); it is leaf-preserving if, for each leaf x in G, J[x) is the 
unique leaf in S that has the same label. 

A reconciliation between a gene tree G and a species 
tree S is a leaf-preserving and order-preserving map 
from V(G) to V(S). Clearly, a reconciliation / between G 
and S is necessarily an inclusion-preserving mapping 
(see [8]), that is, for each pair of vertices u, v in G, u < 
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v implies Hflu)) <= Hflyj). However, the reverse state- 
ment is not true. For instance, the mapping that maps 
each vertex of G to the root of S is an inclusion-preser- 
ving mapping, but according to our definition, it is not 
leaf-preserving, and hence not a reconciliation. 

Note that our definition is consistent with the one 
used in [20], where a reconciliation is defined as a 
mapping from V(G) to V(S) U E(S) that satisfies three 
constraints: base constraint, tree mapping constraint 
and ancestor consistency constraint. Roughly speaking, 
our order-preserving condition corresponds to the 
ancestor consistency constraint, and the leaf-preser- 
ving condition is related to the base constraint, while 
the tree mapaaping constraint is not needed in our 
setting. The main difference between these two frame- 
works is the model used to interpret mappings. For 
example, in [20], a duplication event is associated to a 
vertex v in G if and only if v is mapped to an edge, 
while in our model, whether v is associated with a 
duplication event is not solely determined by the 
image of v. 

A reconciliation represents a hypothetical evolutionary 
history of the gene family. In a gene tree, an internal 
vertex u represents the common ancestor of the genes 
represented by the leaves below it. The property just 
reflects the intuitive fact that u is an ancient gene 
appearing in some common ancestor of the species 
from which the genes are taken. Recall that in species 
tree each branch represents an ancestral species. Under 
the reconciliation / we considered u as the gene ances- 
tor found in the species represented by the branch 
entering flu). 

There is a canonical partial order ^ on the set of 
reconciliations between G and S: for any / and f,f =^/if 
and only if fl(v) < fly) holds for every vertex v in G. 
Define a mapping M from G to S recursively as: 



{the unique leaf with the same label, if u is a leaf; 
lcafA^Uj), M(u 2 )), otherwise. 

M is called the least common ancestor (LCA) reconci- 
liation between G and S. Note that we have M 4 f for 
every reconciliation /between G and S, because it is 
easy to see that M(u) < flu) holds for all u e V(G), by a 
bottom-up traversal. 

Inference of gene duplications 

If the discord of a gene tree G and its containing species 
tree S is due to gene duplication, a reconciliation /between 
them represents a plausible duplication history of genes. 
For an internal vertex u, a duplication event is associated 
with u if and only if one of the following two conditions 
holds: (DA) flu) =flu 1 ),flu) =flu 2 ) or both hold; (D-ii) P(f 
(u),flui)) and P(flu),flu 2 )) contain a common edge. In the 
literature (see [5]), when the LCA reconciliation M is used 
for inferring gene duplications, the duplication condition 
used is (D-i). This is correct for the LCA reconciliation 
between a gene tree and a species tree. However, this 
stringent condition is no longer appropriate as the defini- 
tion of duplication events for arbitrary reconciliations. For 
example, consider the reconciliation /between the gene 
tree G and the species tree S as in Figure 1. If the original 
definition is used, as proposed in [8], only one duplication 
is inferred, which is associated with r. However, one dupli- 
cation cannot produce such a gene family having the gene 
tree G. On the other hand, if our proposed definition is 
used, two duplications are inferred, one associated with r 
and the other with b; the implied duplication scenario is 
given in Figure 2. 

Now, for an internal node u, we let Sj(u) = 1 if there is 
a duplication event associated with it, and 8j{u) = 0 
otherwise. Then the gene duplication cost gd (/) of/ is 
defined as: 
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■ duplication 
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Figure 2 



gd(/) := 



X s f {u) - 

ueV(G) 



(i) 



Gene loss cost 

Let G be a gene tree, S a species tree and /a reconcilia- 
tion between G and S. Then the number of losses lj(u) 
associated to an internal vertex u is defined as: 



l f {u) := 



d{f{u), /(mJ) + d(/(u), f(u 2 )), if tyu) = 1, 
/(Wi)) + /(w 2 )) - 2 otherwise. 



Note that our definition of Ifu) is a generalization 
of the one introduced by Ma et al. in [6], and is con- 
sistent with the one in [20]. When / is the LCA 
reconciliation, our definition agrees with the tradi- 
tional one [5,6]. For later use, it is often convenient 
to combine the two formulae in the above definition, 
i.e., we have: 



l f (u) = d{f{u), /K)) + d{f(u), f(u 2 )) + 2 • (8 f (u) - 1). (2) 

For simplicity, we also set lj(x) = 0 for any leaf x of G. 
The gene loss cost gl(/) of/ is defined as: 



gl(/):= £ l f (u). 

ueV(G) 



(3) 



For example, for the reconciliation fin Figure 1, we 
have gl(/) = 7 by noting that: 

l f [a) = l f (c) = l f {f) = 1 and l f (b) = 4, 
which can also been observed from Figure 2. 
Deep coalescence cost 

If the discord of a gene tree G and a species tree S is 
due to incomplete lineage sorting, a reconciliation / 
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between them is measured by the deep coalescence cost 
[3]. Given a branch e in S, we say that there are k {k > 
0) extra lineages (with respect to f) failing to coalesce on 
e, denoted by xfe) - k, if there exist k + 1 distinct edges 
(u b Vi) (1 < / < k + 1) in G such that e is on the path P(f 
(Ui), J[vi)) for each /; otherwise, we let zj(e) - 0. The deep 
coalescence cost dc(/) of /is then defined as: 

dc(/):= ^ r f (e) 

eeE{S) 

i.e., the total number of the extra lineages with respect 
to / on all branches of S. For example, for the reconcilia- 
tion /in Figure 1, we have dc(/) = 3 by noting that: 

T f {{R,X)) = z f ttX,Y)) =T f {{Y,Z)) = l. 
Results 

The monotonicity of the reconciliation costs 

We first have the following useful observations on the 
gene duplication cost. 

Lemma 1 Let f be a reconciliation between a gene tree 
G and a species tree S. If u is an internal vertex in G 
with children u x and u 2i then the following observations 
hold. 

(i) : Sj{u) = 1 if and only iff(u) e {f(ui), f(u 2 )} or lca(f 
(ui),f{u 2 )) <f(u). 

(ii) : Sj(u) - 0 if and only iff[ui) * fu) *f(u 2 ) and lca(f 
(ui),f{u 2 )) =f{u). 

(Hi): IfUfu^) n L(f[u 2 )) * 0, then dpi) = 1. 

(iv) : Iff[u) >M(u), then Sj(u) = 1. 

(v) : If dp) = 0, thenfu) = M(u) and Lifu^) n L(f[u 2 )) 
= 0. 

Proof: Since Sj(u) is either 0 or 1, (ii) clearly follows 
from (i), and (v) follows from (iii) and (iv). 

To establish (i), it suffices to show that lca^^), fu 2 )) 
<f(u) if and only if P{f[u) f /(%)) and P(f(u)f(u 2 )) share a 
common edge. Indeed, if we have lca(/ , (w 1 )/(w 2 )) <f(u), 
then Pifujtflux)) and P(f{u), f[u 2 )) share the edge that is 
incident to lc&(f{ui), f{u 2 )) and its parent. On the other 
hand, if P(f{u), f{ui)) and P(f{u), f{u 2 )) share a common 
edge (s, s') with s' <s, then s' is a common ancestor of/ 
(ui) and/w 2 ) such that s' <s < fu). Therefore we have L 
(f{ui), fu 2 )) < s' <f{u), as required. 

Now we proceed to prove (iii). If L(f{ui)) n L(f{u 2 )) * 
0, then we have either f[ui) < f(u 2 ) or f(u 2 ) < f(u^). By 
symmetry, we may assume f[ui) < f{u 2 ), and hence lca^ 
{ui), fu 2 )) = f(u 2 ) holds. Now there are two cases to be 
considered, i.e.,/u 2 ) =f(u) and/u 2 ) <j{u). By (i), we can 
conclude Sj{u) = 1 in both of them. 

It remains to show (iv). Note first that we can 
assume \ca(f(ui), f{u 2 )) >M{u), because otherwise we 
have lca( J{ui), J{u 2 )) - M(u) <f(u), and hence dj{u) - 1 
by (i). It follows that/(w/) >M(u) for some i - 1, 2. 



Therefore, by switching u x and u 2 if necessary, we can 
further assume f{ui) >M(u). Now we need to consider 
two cases: f(u 2 ) >M(u) and f(u 2 ) < M(u). If f(u 2 ) >M(u), 
then/(Wi) and/(w 2 ) are both contained in the path P(f 
(u), M(u)), and thus L(f(u^)) n L(f{u 2 )) * 0 holds. On 
the other hand, f(u 2 ) < M(u) implies f(u 2 ) <f(ui), and 
hence also L(f(ui)) n L(f(u 2 )) * 0. Since in both cases 
we have L(f{ui)) n L(f{u 2 )) = 0, by (iii) we obtain Sj{u) 
* 1, as required. Q.E.D 

Note that (i) in the above lemma provides an addi- 
tional characterization of gene duplication events. 
This characterization is easier for calculation while 
the original definition is more natural, from an evolu- 
tionary point of view. By (v) in the above lemma, if a 
speciation event happens at u, i.e., Sj(u) - 0, then we 
have/(w) = M{u). This agrees with the definition of 
reconciliation in [20]. Now we have the following 
main result. 

Theorem 2 Let f and f be two distinct reconciliations 
between a gene tree G and a species tree S with f 4 f; 
then we have: 

gd(f ) < gd(/), gl(f ) < gl(/), and dc(f ) < dc(/). (4) 

In addition, Sf(u) < Sj(u) for each u e V{G), where the 
equality holds for each u e V(G) if and only if gd(/) = 
gd(/). " 

Proof Let D(f,f) be the number of vertices v in V(G) 
with f[v) * f(v); then the following observation plays an 
important role in our proof of the theorem. 

Lemma 3 Let f and f be the two reconciliations as 
given in the theorem. Then there exists a reconciliation J* 
between G and S that satisfies the following three condi- 
tions: 

f'<f*< f, D{f, f* ) = D(f, /) - 1 and D(f*, /) = 1 . 

Proof To establish the above lemma, we select a mini- 
mal element v min (with respect to the partial order < on 
V(G)) in the set {u e V(G) :fu) * f(u)}, which is neces- 
sarily non-empty by the assumption/^/. In other 
words, /(v) = f(v) holds for any v such that v <v min . 
Now consider the map/ defined as: 

\f(v) otherwise. 

Then/' is a reconciliation between G and S. To see 
this, note first that /and / are reconciliations, and 
hence they are leaf-preserving. Therefore we know/* is 
also leaf-preserving. Let u and v be a pair of vertices in 
G with u < v. If u, v g V(G) - {v min }, then f(u) < f(v) 
because /is order-preserving. On the other hand, if u = 
v min an d v * u, then we also have: 
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/ * (V min ) = /Vmin) * f\v) f{v) = f * (v), 

where we use the fact that / is order-preserving and/ 
=^/in the first and second inequality, respectively. 

Finally, suppose v = v min and u * v. By the way that 
^min is chosen, we have/(w) = flu), and hence also: 

/*(«) = /(«) = /'(") * /' (Vmin) = / * (^min)- 

This shows / is order-preserving, and hence / is 
indeed a reconciliation between G and «S. 

It remains to show that/' satisfies the three conditions 
required in the claim. Since f 4f, from the construction 
of/ we have / ^/ 4f. Noting that v min is the only ver- 
tex in V(G) that is mapped to different images by /and 
/, we have D(f\f) = 1. Finally, for any v in G,/(v) *f 
(v) if and only if v * v min and /(v) * /(v). In other 
words, we have D(f,f) = D(f,f) - 1, which completes 
the proof of Lemma 3. Q.E.D. 

Now it suffices to prove the theorem for the special 
case D(f,f) = 1. Indeed, if D(f, f) = m > 1, then by 
Lemma 3, there exist m +1 reconciliations/ :=/, / 2 , 
:=/so that / 4 f i+1 and D{f if f i+1 ) = 1 for 1 < i 
< m. Applying the theorem (in the special case men- 
tioned above) for each pair of reconciliations fi and/ 
+i, we have gd(/J) < gd/ +1 ) for 1 < / < m, and hence 
gd/') = gd(/i) < gd/ m+1 ) = gd(/). Similarly, we can 
show gl(f) < gl(/), dc(/) < dc(/), and S f {u) < Sj(u) for 
each u e V(G), among which the last one implies that 
gd(/) = gd^) if and only if Sj(u) = Sf(u) for each u e V 
(G). 

Now let v be the unique vertex in G with/v) ^/(v). 
Clearly, v is an internal vertex. If v is not the root, let v 0 
:= p(v) be its parent and v 3 be its sibling, that is, the 
other child of v 0 . The remainder argument will be 
divided into three cases, according to the cost measure 
considered. 

Duplication cost case 

Noting that/v,) = f{v t ) </(v) </v) for i = 1, 2, we have: 

^{f{v l ),f{v 2 ))<fXv)<f{v). 

By (i) in Lemma 1, this shows Sj(v) = 1, and hence Sf 
(v) > ^/(v). If v is the root of G, then we have gd(/) - gd 
(/*) = dj(v) - 3f(v) > 0, as required. 

Now we assume v is not the root, and proceed to 
show Sj(v 0 ) > Sf(v 0 ). To begin with, we can assume Sf 
(v 0 ) = 1, because otherwise the inequality trivially holds. 
In addition, we can further assume /(v) </(v 0 ) and/(v 3 ) 
<f( v o)> because otherwise we have Sj{v 0 ) - 1, which also 
implies the inequality. It follows that we have: 

S r {y 0 ) = I, f\v) < f{v) < f(v 0 ) = f\v 0 ) and/Vs) * f(v 3 ) < f{v 0 ) = f'(v 0 ). 



By (i) in Lemma 1, this leads to lca(/ r (v),/(v 3 )) </(v 0 ) 
= /v 0 ). Let 5 be the child of/(v 0 ) so that lca(/ r (v),/(v 3 )) 
< s. Since /(v) < fly) <flv 0 ) and/(v 3 ) = /(v 3 ), s is also a 
common ancestor of/(v) and/(v 3 ). Therefore we have 
lea (f (v),/ (v 3 )) < s </(v 0 ). Using (i) in Lemma 1 again, 
we can conclude Sj{v 0 ) - 1, as required. 

Since v is the only vertex in G with flv) * f(v), for 
each internal vertex g e V(G) - {v, v 0 } and its two chil- 
dren gi and g 2 , we have: 

/(*) = f{&). fiSi) = f'(Zi) and/( g2 ) = f(g 2 ). 

By definition, this implies df(g) = Sf(g) for all g e V 
(G) - {v, v 0 }. Combining the above observations, we can 
conclude that Sj{u) > Sf(u) for each u e V^(G). This 
leads to gd(/) > gd(/), where the equality holds if and 
only if Sj(u) = Sf(u) for each we V(G). 

Gene loss case 

Since /v/) </(v) <flv) holds for r = 1,2, we have: 

/(v,)) = d(f(v), f\v)) + /(vj) for i = 1, 2. (5) 

Together with the definition of If we obtain: 

l f {v) -l f (v) = d{f{v),f{ Vl )) + d{f{v),f{v 2 )) + 2 ■ {8 f {v) - 1) 

-d{f\v),f{v,)) ~ d{f'{v),f{v 2 )) ~ 2 • (8 f >{v) - 1) 
= 2d(f(v),fXv)) + 2.(S f (v)-8 f iv)). 

Since 3j{v) > Sf(v), following the proof of the duplica- 
tion cost case, and d(f(v), f(v)) > 0, we can conclude 
that lj(v) - lf(v) > 0. If v is the root of G, then this leads 
to gl(/) > gl(A as required. 

Now we assume v is not the root of G. Then we have: 

IfM -l r (?o) = d{f{vol /M) + dUM f{v 3 )) + 2 ■ (8 f (v 0 ) - 1) 

-d(f(v 0 ). f M) " mvol f(v 3 )) -2 {8 r (v 0 ) - 1) 
= -d(f(v),f(v)) + 2 ■ (5 f (v 0 ) -8 r (v 0 )), 

where we use the observation that f\v) <flv) < f(v 0 ) 
implies: 

d{f{v 0 ),f'(v)) = d{f{v 0 ),f{v)) + d{f{v),f\v)). 
Combining these results, we have: 

gl(/) " gKf ) = IfM + l f {v) -l r {v 0 ) -l r {v) = d{f{v),f{v)) + 2 ■ (gd(/) - gd(f )). 

Since gd(/) > gd^), following the proof of the duplica- 
tion cost case, and d(flv),fl(v)) > 0, we obtain gl(/) > g\(f 
'), which completes the proof of this case. 

Deep coalescence case 

Let Ej(S) be the set of edges e in S such that there exists 
an edge (u,u f ) in G such that e is contained in the direc- 
ted path from f(u) to flu'). Now by counting extra 
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lineages in terms of the edges contained in paths that 
have form P(f(u), f(u')) for some edge (u, u') in G, we 
have: 

dc(/) = -|E / (S)|+ £ d{f{u),f{u')). (6) 

{u,u')eE{G) 

Since Ej(S) - Ef(S) and flu) - f(u) for u * v, the above 
formula implies: 

2 

dc(/) - dc(f ) = )-/(")) -<i(/M- f 00) = 4/W/'M) > o, 

if v is not the root of G. Here in the second equality 
we use the observation that/(v) is on the directed path 
komfly 0 ) to/(v), and for i = 1,2, /(v) is on the directed 
path from/(v) to fly i). If v is the root of G, then a simi- 
lar argument leads to: 

dc{f)-dc{f) = 2-d{f{v),f{v))>0, 

which completes the proof. Q.E.D. 

Since the LCA reconciliation is the minimal element 
in the space of reconciliations, the above theorem leads 
directly to the following result. 

Corollary 4 Among all reconciliations between a gene 
tree G and a species tree S, the LCA reconciliation has 
{a) the minimum gene duplication cost[9], (b) the unique 
one with the optimal gene loss cost[15]and the optimal 
deep coalescence cost 

Note that there is a close relationship among the gene 
duplication, gene loss and deep coalescence costs [7]. 
From their relationship, one can easily obtained the fact 
that the LCA is the unique one with the optimal gene 
loss cost from that it is the unique one with the optimal 
deep coalescence, but the reverse is not clear. 

Gd-optimal reconciliations 

By Corollary 4, the LCA reconciliation is the unique 
optimal reconciliation for the gene loss cost, as well as 
the deep coalescence cost. However, the LCA reconcilia- 
tion may not be the unique optimal one for the gene 
duplication cost (see [15]). For example, for the reconci- 
liation / in Figure 1 and the LCA reconciliation M 
between the gene tree and species tree in Figure 1, we 
have gd(/) = gd(M) = 2. Since the reconciliations with 
the minimum gene duplication cost, which we shall 
refer to as gd-optimal reconciliations, may not be 
unique, in this section we will present a characterization 
of them, using the theoretical results developed above. 

By Theorem 2, a reconciliation /is gd-optimal if and 
only if 3j{u) - S M (u) holds for each vertex u in G. Based 
on it, we will show that there exists a unique maximal 
gd-optimal reconciliation M* so that / is gd-optimal if 



and only iff4 M* holds. The reconciliation M* between 
a gene tree G and a species tree S can be constructed as 
follows. For all u e V(G) with S M (u) = 0, M* maps u to 
M(u), i.e., M*{u) = M(u). For those u e V(G) with 3 M (u) 

- 1, we shall define M*(u) recursively. If u - r(G), i.e., it 
is the root of G, then M*{u) is defined as r(S), the root 
of S. Otherwise, M*(p(u)) has been defined, and M* (u) is 
defined as: 

M*(U) = 1 M * {P{U)) ' if<5 M (p(u)) = l, 

[ The largest vertex s in S satisfying M(u) < s < M* (p(u)), otherwise. 

If u is a vertex in G such that u * r(G), then S M (p(u)) 
= 0 implies M{u) <M{p{u)) < M*{p{u))> hence the map- 
ping M* is well defined. In addition, M* is also a recon- 
ciliation between G and S. To see this, note that if u is a 
leaf in G, then we have S M (u) = 0, which implies M*{u) 

- M(u) and hence M* is leaf-preserving. On the other 
hand, by the construction of M*, it is order-preserving. 
For example, for the gene tree and species tree in Figure 
1, the reconciliation Af* is defined as: 

M * (a) = Y, M * (b) = M * (c) = M * (r) = R, and M * (i) = i for 1 < i < 5. 

In this example, it is not difficult to check that gd(/) = 
gd(M*) holds for all / ^ M*, which also follows directly 
from the following general result. 

Theorem 5 Given a gene tree G and a species tree S, a 
reconciliation f is gd-optimal if and only if M 4 f< M* 
holds. In particular, M* is the unique maximal gd-opti- 
mal reconciliation between G and S. 

Proof We need only to show that gd(/) = gd(M) for a 
reconciliation / if and only if M < f< M* holds, because 
this implies M* is indeed the unique maximal gd-opti- 
mal reconciliation. 

To show that gd(M) = gd(/) holds for every reconcilia- 
tion /with M ^f^ M\ it suffices to prove gd(M*) = gd 
(M), because together with Theorem 2, this implies gd(/) 
= gd(M) = gd(M*). To this end, we need only to show 
S M (u) - S M *(u) for each internal vertex u in G. Now fix 
an internal vertex u in G. Since M 4 M*, we have S M *{u) 
> S M {u) by Theorem 2. If S M (u) = 1, then we have S M * 
(u) = 1 = S M (u). Therefore it remains to consider the 
case S M (u) = 0. By (ii) in Lemma 1, S M (u) = 0 implies M 
(ui) * M(u) * M(u 2 ). Together with the construction of 
M*, we have M 9 (u ± ) * M*{u) * M*(u 2 ). Since M(u t ) < 
M*(Ui) < M*(u) for i = 1, 2, we have: 

lcafMfuj), M(u 2 )) < lca(iVi * (uj, M * (u 2 )) < M * (u) = M(u). 

By the construction of M, we know lca(M(w!), M(u 2 )) 
= M(u), and hence: 

lca(A4 * (uj, M * (u 2 )) = M * (u). 

By (ii) in Lemma 1, this shows d M * (u) - 0, as required. 



Wu and Zhang BMC Bioinformotics 2011, 12(Suppl 9):S7 
http://www.biomedcentral.eom/1 471 -21 05/1 2/S9/S7 



Page 8 of 10 



To establish the other direction, assume gd(f) = gd(M) 
for a reconciliation / and we shall show / ^ M\ i.e., j{u) 
< M*{u) for each internal u in V(G). To this end, fix an 
internal vertex u in G, and denote its two children by u x 
and If $j(u) - 0, then by (v) in Lemma 1 we have/ 
(u) = M(u), and hence f(u) = M*{u). Therefore, it 
remains to prove f(u) < M*(u) for 5j(u) = 1, which will 
be established by induction. The base case is u being 
the root of G; then M\u) is the root of S, and/w) < M* 
(u) trivially holds. For the induction step, let u 0 := p(u) 
be the parent of u; then the induction assumption is/ 
(u 0 ) < M*(u 0 ). Now if S M (u 0 ) = 1, then by the definition 
of Af* we have: 

f{u)Zf{u 0 )£M*{u 0 ) = M*{u). 

Otherwise, we have S M (u 0 ) = 0. Together with M ^ / 
M ^ M* and gd(M) = gd(/) = gd(Af), this leads to Sj(u 0 ) 
- S M *(uo) - 0 by Theorem 2. In view of (v) in Lemma 1, 
we obtain M(u 0 ) = f(u 0 ) = M*(u 0 ). Since Sj(u 0 ) = 0, (ii) 
in Lemma 1 implies flu) <f{u 0 ), and hence also: 

M{u)<f{u)<f{u 0 ) = M{u 0 ). 

By definition, M*(u) is the largest vertex in the set {s : 
M(u) < s <M*(u 0 )}. Since M*(u 0 ) = M(u 0 ), we can con- 
clude flu) < M*(u), which completes the proof. Q.E.D. 

Enumerate nearly-optimal reconciliations 

Recall that there are other reconciliations having the 
minimum duplication cost than the LCA reconciliation. 
Moreover, in a biological study, a nearly-optimal recon- 
ciliation could be the correct solution to its problem. 
Therefore, it is of interest to study the following pro- 
blem [20]: Given a positive number e, compute the set 
of nearly -optimal reconciliations that have the duplica- 
tion cost less than or equal to gd(M) + s, where gd(M) 
is the minimum duplication cost a reconciliation 
between the gene tree and the species tree can have. 
Such a subset of the nearly-optimal reconciliations is 
denoted by T e (G, S, gd), which is also a subset of T(G, 
S), the set of all reconciliations between G and S. 

In this section we will present an algorithm for enu- 
merating T e (G, «S, gd). To this end, we need to introduce 
some additional definitions. Following [20], for a vertex u 
g V(G), let \d[u) be the number of vertices that precede 
u according to the prefix traversal of G, where the left 
child Ui of a vertex u e V°(G) is visited before the right 
child For a reconciliation /in T(G, 5), and a vertex u 
g V°(G) with/w) * r(S),f[u] is a mapping defined as: 



For an internal vertex u with u * r(G),f[u] is a recon- 
ciliation if and only if f(u) <f(p(u)); for the root r(G) of 
G,/[r(G)] is a reconciliation if and only if /(r(G)) <r(S). 
In both cases, we will say that the reconciliation f[u] is 
obtained from /by applying a Nearest Mapping Change 
(NMC) operator on u; this operator is adapted from the 
one introduced in [20]. Similarly, we can define f[u lf ... , 
u k ] for a sequence of (not necessarily distinct) vertices 
in G. Note that for a reconciliation / in T(G, S) with/* 
M, there exists a unique sequence u lf ... , u k so that/ = 
M[ui, ... , Uk] and id(ui) < id(u i+ i) for i - 1,..., k - 1; now 
id(/) is defined as id{u k ), where u k is the last vertex in 
this sequence. For completeness, we will use the con- 
vention id(M) = 0. Finally, for a reconciliation /in T(G, 
S), we set: 

K{f) := {u e V°(G) : f[u] is in T(G, S) and id(u) > id(/)}, 

where K(f) will be regarded as an ordered list (with the 
order induced by id). 

The NMC operator induces a tree structure on the set 
T(G, S): the root is M;f is a child of/ if and only if / =/ 
[#] for some u s /<"(/). This tree, whose vertex set is T(G, 
«S), will be denoted by T(G, «S). The idea of considering a 
tree structure on the space of reconciliation was intro- 
duced in [20] . Clearly, by Theorem 2, the restriction of T 
(G, S) on T e (G, S, gd) is a subtree, which will be referred 
to as r e (G, «S, gd). Now we can state our algorithm as fol- 
lows, which enumerates T e (G, S, gd) by a traversal of 
r e (G, S, gd). Here U stands for disjoint union. 

Algorithm for enumerating nearly - optimal reconciliations 

Input : A gene tree G and a species tree S with L(G) c: L(S), and e > 0. 

Output : The set T £ (G, S, gd) of nearly-optimal reconciliations. 



Construct the LCA reconciliation M between G and S. 
Set T := {M}, id(M) = 0, A(M) := 0, B := {M}. 
While Bit 0, set B' := j0f and do{ 
For each / e B, do{ 

For each node u e construct the map/' := /[w], and set id(/') = id(w). 

\A(f) + 5 r (u)-S f (u), if u = r(G); 

\ A(/) + <5 r (u) + 5 f (p(u)) -8 f {u) -5 f (p(u)), otherwise. 
5b : If A(f) < e, set B' <- {/'} U B'. } undo 

6 : Set T <- T U B' and B <- B'. } undo. 
7: Output T. 



2 
3 
4 
5: 

5a: 



Calculate A(f) := \ 
IfA(f)<e 



/MM ? 



\p{f{u)) ifv = u, 
[ f(v) otherwise. 



To see the running time of the above algorithm, note 
first that for a reconciliation / K(f) is a subset of V\G), 
and for each u e V°(G), whether u e K(f) or not can be 
determined in constant time, when \d{u) and \d(f) are 
known. In addition, if S M is given, then line 5a and 5b 
can be computed in constant time; the proof of this 
observation will be presented in the full version of this 
paper. Therefore, the above algorithm runs in time 0( | V 
(G)| . |T e (G, S, gd) |), plus additional preprocessing time 
to compute id{u) and S M {u) for each u e V\G). 
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Two facts prevent us from designing better algorithm 
for the enumeration problems. The first one concerns 
the boundary set B e (G, S, gd), which consists of all 
reconciliation fin T(G, S) - T e (G, S, gd) such that for 
some / e r c (G, S, gd), f is a child off in T(G, S). In 
order to enumerate T e (G, S, gd), an algorithm typically 
needs to visit not only the reconciliations in T e (G, S, 
gd), but also those in B e (G, S, gd). However, \B e (G, S, 
gd)| could be as large as 0(\V(G)\ . \T e (G, S, gd)|). For 
instance, if G and S have the same tree structure on n 
+ 1 leaves, then T 0 (G, S, gd) = {M} but \B 0 (G, S, gd) | 
contains n - 1 reconciliations. Furthermore, we have |I\ 
(G, 5, gd)| = n and ^(G, S, gd) | = 0(n 2 ). 

The other concern is about the set K e (f) := {u e V(G) : 
is in r e (G, «S, gd) and id(w) > id(/)}, which is needed 
if we want to explore T e (G, «S, gd) without visiting the 
boundary set B S (G, S, gd). However, some properties of 
these two sets, K(f) and K e (j), are different. For instance, 
the following property of K(j) is crucial to the optimal 
algorithm for exploring T(G, S) (see Property 5 and Pro- 
position 4 in [20]): If u is the first vertex in K(f) and/ = 
f[u], then we have K(f) - K(f) Q {u}. However, this does 
not hold for K e . To see it, considering the example men- 
tioned in the previous paragraph, and denoting the first 
child of r(G) by r v then we have K^M) = V\G) - {r(G)} 
while KAMln]) = 0. 

Since T 0 (G, S, gd) contains the gd-optimal reconcilia- 
tions, the above algorithm also provides a method for 
enumerating all the optimal reconciliations between a 
gene tree and a species tree. Since T e (G, S, gl), as well as 
T e (G, S, dc), is also a subtree of T(G, «S), we also remark 
that it can be modified to list nearly-optimal reconcilia- 
tions with respect to the gene loss or deep coalescence 
cost. Due to the limited space, the details of these algo- 
rithms are omitted here and one is referred to the full 
version of this work appearing in our personal website. 
As our on-going work, the algorithms presented here 
will be coded in C++ and evaluated by comparing them 
with the existing ones on simulation data. 

Conclusions 

To investigate all reconciliations between a gene tree 
and a species tree, we have generalized the LCA recon- 
ciliation to define an arbitrary reconciliation as a vertex 
mapping from the gene tree to the species tree. This 
provides a new framework for investigating various 
mathematical issues of the reconciliation space. It allows 
us to give a unified approach to study reconciliations 
with each of the cost models. As applications, we show 
that the LCA reconciliation is the unique one having 
the smallest deep coalescence cost, and present a char- 
acterization of the reconciliations with the minimum 
gene duplication cost; we also develop efficient algo- 
rithms to enumerate nearly- optimal reconciliations with 



each cost models. In future, we shall incorporate other 
evolutionary forces behind the gene tree heterogeneity, 
such as horizontal gene transfer and recombination, into 
this framework. 
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